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Abstract 

Abstracting abstract machines has been proposed as a lightweight 
approach to designing sound and computable program analyses. 
The approach derives abstract interpreters from existing machine 
semantics and has been applied to a variety of languages with 
features widely considered difficult to analyze. Although sound 
analyzers are straightforward to build under this approach, they are 
also prohibitively inefficient. 

This article contributes a step-by-step process for going from 
a naive analyzer derived under the abstracting abstract machine 
approach to an efficient program analyzer. The end result of the 
process is a two to three order-of-magnitude improvement over 
the systematically derived analyzer, making it competitive with 
hand-optimized implementations that compute fundamentally less 
precise results. 

1. Introduction 

The abstracting abstract machines (AAM) approach 1 27 29 1 to de- 
riving program analyses provides a systematic way of transforming 
a programming language semantics in the form of an abstract ma- 
chine into a family of abstract interpreters. The approach parame- 
terizes these families with policies for regulating analytic precision. 
While flexible and robust, the AAM approach unfortunately yields 
analyzers with poor performance relative to hand-optimized ana- 
lyzers. Our work takes aim squarely at this "efficiency gap," and 
narrows it in an equally systematic way. 

By taking a machine-oriented view of computation, AAM 
makes it possible to design, verify, and implement program ana- 
lyzers for realistic language features typically considered difficult 
to model. The approach was originally applied to features such as 
higher-order functions, stack inspection, exceptions, laziness, first- 
class continuations, and garbage collection. It has since been used 
to verify actor- [ 7 ] and thread-based [ 1 9 1 parallelism and behavioral 
contracts 1 26 1 ; it has been used to model Coq 1 22 1 , Dalvik 1 2 1 1 , Er- 
lang (8), JavaScript |28), and Racket l26l . 

The primary strength of the approach is that abstract interpreters 
can be easily derived through a small number of steps from exist- 
ing machine models. Since the relationships between abstract ma- 
chines and higher-level semantic models — such as definitional in- 
terpreters 1251 . structured operational semantics 1241 . and reduction 
semantics II II — are well understood |5|, it is possible to navigate 
from these high-level semantic models to sound program analyz- 
ers in a systematic way. Moreover, since these analyses so closely 
resemble a language's interpreter (a) implementing an analysis re- 
quires little more than implementing an interpreter, (b) a single im- 
plementation can serve as both an interpreter and analyzer, and (c) 
verifying the correctness of the implementation is straightforward. 

However, there is a considerable weakness with the approach: 
an analyzer designed and implemented by following the AAM 
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Figure 1. Factor improvements over the baseline analyzer for the 
Vardoulakis and Shivers benchmark in terms of peak memory us- 
age, the rate of state transitions, and total analysis time. (Bigger is 
better.) Each point is marked with the section that introduces the 
optimization. 



recipe is prohibitively inefficient without both further approxima- 
tion and further implementation effort. 

In this article, we develop a systematic approach to deriving the 
feasible implementation of an abstract-machine-based analyzer. 

2. At a glance 

This paper is organized in two halfs: in the first, we start with a 
quick review of the AAM approach to develop an analysis frame- 
work and then apply our step-by-step optimization techniques in 
the simplified setting of a core functional language. This allows us 
to explicate the optimizations with a minimal amount of inessen- 
tial technical overhead. In the second half, we scale this approach 
up to an analyzer for a realistic untyped, higher-order imperative 
language with a number of interesting features and then measure 
improvements across a suite of benchmarks. 

At each step during the initial presentation and development, 
we evaluate the implementation on a benchmark from Vardoulakis 
and Shivers 1 30 1 that tests distributivity of multiplication over ad- 
dition on Church numerals. For the step-by-step development, this 
benchmark is particularly informative: 

1. it can be written in most modern programming languages, 

2. it was designed to stress an analyzer's ability to deal with 
complicated environment and control structure arising from the 
use of higher-order functions to encode arithmetic, and 

3. it proves to be the least improved benchmark of the complete 
suite considered in section|6] and thus it serves as a good sanity 
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(a) Baseline 



(b) Lazy 



(c) Compiled (& lazy) 



Figure 2. Example state graphs for the program above. Part (a) 
shows the result of the baseline analyzer. It has long "corridor" 
transitions and "diamond" subgraphs that fan-out from nondeter- 
minism and fan-in from joins. Part (b) shows the result of perform- 
ing nondeterminism lazily and thus avoids many of the diamond 
subgraphs. Part (c) shows the result of abstract compilation that re- 
moves interpretive overhead in the form of intermediate states, thus 
minimizing the corridor transitions. The end result is a more com- 
pact abstraction of the program that can be generated faster. 



check and lower-bound for each of the optimization techniques 
considered. 

We start, in section [3] by developing an abstract interpreter 
according to the AAM approach. Without further abstraction, the 
analysis is exponential due to per-state store variance and thus 
cannot analyze the example in a reasonable amount of time. In 
section [4] we perform a further abstraction by widening the store. 
The resulting analyzer sacrifices precision for speed and is able to 
analyze the example in about 1 minute. This step is described by 
Van Horn and Might [29 §3.5-6] and is necessary to make even 
small examples feasible. We therefore take the widened interpreter 
as the baseline for our evaluation. 

Section [5] gives a series of simple abstractions and implemen- 
tation techniques that, in total, speed up the analysis by nearly a 
factor of 500, dropping the analysis time to a fraction of a second. 
Figure [T] shows the step- wise improvement of the analysis time for 
this example. 

The AAM approach, in essence, does the following: it takes 
a machine-based view of computation and turns it into afinitary 
approximation by bounding the size of the store. With a limited 
address space, the store must map addresses to sets of values. 
Store updates are interpreted as joins, and store dereferences are 
interpreted by non-deterministic choice of an element from a set. 
The result of analyzing a program is a finite directed graph where 
nodes in the graph are (abstract) machine states and edges denote 
machine transitions between states. 
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var (a;) 

nt^) 

lam (x, e) 
app i (e, e) 
if £ (e,e,e) 
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I 1 I -1 I ... 
tt I f f 

zero? I addl I subl 



Figure 3. Syntax of ISWIM 



The techniques we propose for optimizing analysis fall into the 
following categories: 

1 . generate fewer states by avoiding the eager exploration of non- 
deterministic choices that will later collapse into a single join 
point. We accomplish this by applying lazy evaluation tech- 
niques so that non-determinism is evaluated by need. 

2. generate fewer states by avoiding unnecessary, intermediate 
states of a computation. We accomplish this by applying com- 
pilation techniques from functional languages to avoid interpre- 
tive overhead in the machine transition system. 

3. generate states faster. We accomplish this by better algorithm 
design in the fixed-point computation we use to generate state 
graphs. 

Figure [2] shows the effect of (1) and (2) for a small example due 
to Earl, et al. |9|. By generating significantly fewer states at a 
significantly faster rate, we are able to achieve large performance 
improvements in terms of both time and space. 

Section [6] describes the evaluation of each optimization tech- 
nique applied to an implementation supporting a more realistic set 
of features, including mutation, first-class control, compound data, 
a full numeric tower and many more forms of primitive data and 
operations. We evaluate this implementation against a set of bench- 
mark programs drawn from the literature. For all benchmarks, the 
optimized analyzer outperforms the baseline by at least a factor of 
two to three orders of magnitude. 

Section [7] relates this work to the literature and section [S] con- 
cludes. 

3. An analyzer for ISWIM 

In this section, we give a brief review of the AAM approach by 
defining a sound analytic framework for a core higher-order func- 
tional language: Landin's ISWIM 1 15 1. In the subsequent sections, 
we will explore optimizations for the analyzer in this simplified set- 
ting, but scaling these techniques to realistic languages is straight- 
forward and has been done for the analyzer evaluated in sectionJS] 

ISWIM is a family of programming languages parameterized 
by a set of base values and operations. To make things concrete, we 
consider a member of the ISWIM family with integers, booleans, 
and a few operations. Figure [3] defines the (abstract) syntax of 
ISWIM. It includes variables, literals (either integers, booleans, or 
operations), A-expressions for defining procedures, procedure ap- 
plications, and conditionals. Expressions carry a label, £, which is 
drawn from an unspecified set and denotes the source location of 
the expression; labels are used to disambiguate distinct, but syntac- 
tically identical pieces of syntax. We omit the label annotation in 
contexts where it is irrelevant. 

The semantics are defined in terms of a machine model. The 
machine components are defined in figure [4] and figure [5] defines 
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Values 
States 



Continuations 



Environments 
Stores 



v = clos (a;, e, p) \ I \ k 
c = ev (e, p, a, ft) 
co (ft, v, a) 
ap (v, v, cr, ft) 
k — mt 

fn(v,a) 
ar (e, p, a) 
f±(e,e,p,a) 
£ Var Ac?c?r 
6 Addr ^ V {Value) 



P 
a 



Figure 4. Abstract machine components 



eval(e) — {c | ev E (e,0,0,mt) i — » ?} where 



co (ft, v, a) where v £ a(p(x)) 
co (ft, Z, a) 

co (ft, clos (x, e, p), a) 
ev a (e 0l p,cr',arf(ei,p,a)) 
where a, cr' = pushp (cr, ft) 
ev a (erj, p, a', f i 4 (ei, e 2 , p, a)) 
where a, a' = push s t (cr, ft) 

ans (cr, w) 

ev a (e,p,cx,f]if(i),a)) 

ap*(u, w, ft, cr) where ft £ <j{a) 



ev (var (a;), p, cr, ft) 
ev (lit (l),p,a,K) 
ev (lam (a;, e), p, a, ft) 
ev s (app e (e ,e 1 ),p,a, ft) 

ev (if (eo,ei,e 2 ),p, a, k) 

co (mt, v, <t) 
co (arf(e,p,a),v,a) 
co (fnf (u, a), v, a) 
co (f i 5 (eo, ei, p, a), tt, a) i — > ev"(eo, p, cr, ft) where ft £ cr(a) 
co (f i a (eo, ei, p, a), ff , cr) i — ► ev l5 (ei, p, cr, ft) where ft £ a (a) 

ap|(clos (a;, e, p), v, cr, ft) i — > ev s (e, p', cr', ft) 

where p' ,a' ,5' = bindf(p, a, x, v) 
apf(o, v, cr, k) i — ► co (k, v' , cr) where v' £ A(o, w) 



Figure 5. Abstract abstract machine for ISWIM 



the transition relation. The evaluation of a program is defined as 
the set of states reachable by the reflexive, transitive closure of the 
machine transition relation. The machine is a very slight variation 
on a standard abstract machine for ISWIM in "eval, continue, ap- 
ply" form [5]. It can be systematically derived from a definitional 
interpreter through a continuation-passing style transformation and 
defunctionalization, or from a structural operational semantics us- 
ing the refocusing construction of Danvy and Nielsen (6). 

Compared with the standard machine semantics, this definition 
is different in the following ways, which make it abstractable as a 
program analyzer: 

• the store maps addresses to sets of values, not single values, 

• continuations are heap-allocated, not stack-allocated, 

• there are "contour values" (written 8) and syntax labels (I) 
threaded through the computation, and 

• the machine is implicitly parameterized by the functions push, 
bind, and A. 



Concrete interpretation To characterize concrete interpretation, 
set the implicit parameters of the relation given in figure [5] as 
follows: 



The resulting relation is non-deterministic in its choice of ad- 
dresses, however it must always choose a fresh address when al- 
locating a continuation or variable binding. If we consider machine 
states equivalent up to consistent renaming, this relation defines a 
deterministic machine. (The relation is really a function.) 

The interpretation of primitive operations is defined by setting 
A as follows: 



z + le A(addl, z) 
tt £ A(zero?, 0) 



Z-16 A(subl,z) 

ff £ A(zero?,z) if z 



Abstract interpretation To characterize abstract interpretation, 
set the implicit parameters just as above, but drop the a ^ a condi- 
tion. This family of interpreters is also non-deterministic in choices 
of addresses, but it is free to choose addresses that are already in 
use. Consequently, the machines may be non-deterministic when 
multiple values reside in a store location. 

It is important to recognize from this definition that any allo- 
cation strategy is an abstract interpretation |20|. In particular, con- 
crete interpretation is a kind of abstract interpretation. So is an in- 
terpretation that allocates a single cell into which all bindings and 
continuations are stored. On the one hand is an abstract interpre- 
tation that is non-computable and gives only the ground truth of a 
programs behavior; on the other is an abstract interpretation that is 
easy to compute but gives little information. Useful program anal- 
yses lay somewhere in between and can be characterized by their 
choice of address representation and allocation strategy. Uniform 
fc-CFA 1 23 1 is one such analysis. 

Uniform k-CFA To characterize uniform fc-CFA, set the alloca- 
tion strategy as follows, for a fixed constant k: 



push s e (a, k) 


= 18, cr U 


binde(p, a, x, v) 


= p[x H> 


where S' 


= [U\h 


a 


= xS 


15 \o 


= e 




= £[8\ k 



[18 M] 
a] , a U [oH 



Ml. *' 



push t (cr, ft) = a, cr U [a 
bindi(p,a,x,v) — p[x H > a] 



-» {ft}] where a ^ a 

a U [a i— > {v}] where a ^ cr 



where U on stores is a point-wise lifting of U: cr U cr' = \a.a(a) U 
a' (a). The j • [t notation denotes the truncation of a list of symbols 
to the leftmost k symbols. 

All that remains is the interpretation of primitives. For abstract 
interpretation, we set A to the function that returns Z on all inputs — 
a symbolic value we interpret as denoting the set of all integers. 

At this point, we have abstracted the original machine to one 
which has a finite state space for any given program, and thus forms 
the basis of a sound, computable program analyzer for ISWIM. 

4. Reduction semantics to baseline analyzer 

The uniform fc-CFA allocation strategy would make eval in fig- 
ure [5] a computable abstraction of reachable states, but not an effi- 
cient one. This is not the strategy that AAM, nor we, recommend. 
Through this and the following section, we will explain a succes- 
sion of approximations to reach the baseline analysis. We'll com- 
pare performance at each stage to identify the criticality of each 
optimization. We ground this journey by first formulating the anal- 
ysis in terms of a classic fixed-point computation. 

4.1 Static analysis as fixed-point computation 

Conceptually, the AAM approach calls for computing an analysis 
as a graph exploration: (1) start with an initial state, and (2) com- 
pute the transitive closure of the transition relation from that state. 
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We can cast this exploration process in terms of a fixed-point 
calculation. Given the initial state ?o and the transition relation i — 
we define the global transfer function: 

F f0 : V(State) -> T(State). 

Internally, this global transfer function computes the successors of 
all supplied states, and then includes the initial state: 

F, (S) = {co} U {<;' | 5 6 S and ? i — ► ?'}. 

Then, the evaluator for the analysis computes the least fixed-point 
of the global transfer function: 

eval(e) = lfp(F,; ), 

where ?o = ev e (e, 0, 0,mt). 

To conduct this naive exploration on the Vardoulakis and Shiv- 
ers example would require considerable time. Even though the state 
space is finite, it is exponential in the size of the program. Even with 
k = 0, there are exponentially many stores in the AAM framework. 

In the next subsection, we'll fix this with a widening and reach 
polynomial (albeit of a high degree) complexity. This widening 
effectively lifts the store out of individual states to create a single, 
global shared store for all. 

4.2 Store widening 

A common technique to accelerate convergence in flow analyses 
is to share a common, global store. To retain soundness, this store 
grows monotonically. Formally, we can cast this optimization as 
a second abstraction or as the application of a widening operator 
during the fixed-point iteration. In the ISWIM language, such a 
widening makes 0-CFA quartic in the size of the program. Thus, 
in one step, complexity drops from intractable exponentially to a 
merely daunting polynomial. 

Since we can cast this optimization as a widening, there is no 
need to change the transition relation itself. Rather, what changes is 
the structure of the fixed-point iteration. In each pass, the algorithm 
will collect all newly produced stores and join them together. Then, 
before each transition, it will install this joined store into current 
state. 

To describe this process, we'll refactor the transition relation so 
that it operates on a pair of a set of contexts (C) and a store (a). A 
context includes all non-store components, e.g., the expression, the 
environment and the stack. The refactored relation, P^?, becomes: 

(0)p=?(cV) 

where C' = {c | wn(c, o) i — > wn(c, a c ), c G C} 

a ' = | | {o c | wn(c, a) i — > wn(c , o c ), c G C} 

wn(ev (e, p, k), a) — ev (e, p, a, k) 

um(co (w, k), a) — co (v, a, k) 
u>n(ap (u, v, k),ct) = ap (u, v, a, k) 
OTl(ans (v), a) = ans (a, v) 

In effect, the new store is computed as the least upper bound of all 
subsequent stores. 

4.3 Store-allocated results 

The final approximation we make to get to our point of departure is 
store-allocating results of application sub-expressions. The AAM 
approach stops at the previous optimization. However, the f n con- 
tinuation stores a value, and this makes the space of continuations 
quadratic rather than linear in the size of the program — even for 
a monovariant analysis like OCFA. Having the space of continua- 
tions grow linearly with the size of the program will drop the overall 
complexity to cubic (as expected). 



To achieve this linearity for continuations, we allocate an ad- 
dress for the value position when we create the continuation. This 
address and the tail address are both determined by the label of the 
application point, so the space becomes linear and the overall com- 
plexity drops to cubic. This is a critical abstraction in languages 
with ra-ary functions, since otherwise the continuation space grows 
super-exponentially. 

If we specialize to OCFA, the evaluation rules become: 

ev (app*(eo,ei),p,<r, k) i — > ev (e ,p,aU [£ i-> {«}],ar (ei, 
co (ar (e, p, £), v, a) i — > ev (e,p,aU [£ f i-> {v}],fn (£ f , 
co (fn (£ f ,£),v,a) i — ► a.p e (u,£ a , K,aU [t h-> {«}]) 
where n G cr(a), u G cr(a^) 
apf(clos (x, e, p),£ a ,a, k) i — > ev (e, p[x x] , a U [a; <-¥ <?(£ a ) 
a.pt(o,£ a ,a,K) i — ► co (k, v',a) 

where v G {A(o,v) \ v G o-(t)} 

5. Implementation techniques 

In this section, we discuss the optimizations for abstract interpreters 
that yield our ultimate performance gains. We have two broad 
categories of these optimizations: (1) transition elimination and (2) 
pragmatic improvement. The transition-elimination optimizations 
reduce the overall number of transitions made by the analyzer by 
performing: 

1. lazy non-determinism; 

2. abstract compilation; and 

3. uniform literal approximation. 

The pragmatic improvements reduce overhead and trade space for 
time by utilizing: 

1. store deltas; 

2. timestamped stores; and 

3. preallocated data structures. 

Some techniques preserve the precision of the underlying anal- 
ysis, and others do not. For any technique that loses precision, we 
will discuss the design rationale for the move. 

5.1 Lazy non-determinism 

Tracing the execution of the analysis reveals an immediate short- 
coming: there is a high degree of branching and merging in the 
exploration. Surveying this branching has no benefit for precision. 
For example, in a function application, (f x y), where f, x and 
y each have several values each argument evaluation induces two- 
way branching, only to be ultimately joined back together in their 
respective application positions. Transition patterns of this shape 
litter the state-graph: 




To avoid the spurious forking and joining, we delay the non- 
determinism until and unless it is needed in strict contexts (such 
as the guard for an if a called procedure, or a numerical primitive 
application). Doing so collapses these forks and joins into a linear 
sequence of states: 



In case it's unclear, this shift does not change the concrete se- 
mantics of the language to be lazy. Rather, it abstracts over tran- 
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sitions that the original non-deterministic semantics steps through. 
We say the abstraction is lazy because it delays dereferencing an 
address until its contents are needed as values in the semantics. It 
does not change the execution order that leads to the values that are 
stored in the address. 

We introduce a new kind of value, addr (a), that represents a 
delayed lookup of a. The following rules highlight the changes to 
the semantics: 



force : Store x Value — > V(Value) 



force(a, addr (&)) 
force(a, v) 
ext{o, a, v) 
ev (var (x), p, ft, a) 
co (arf(e,p, a),v,a) 
bindi(p, a, x, v) 



a(b) 
{v} 

a U [a H> force(a, v)] 
— > co (ft, addr (p(x)), a) 
— > ev S (e, p, ext(o,a^ ,v),fni(a^ ,a)) 

p[x h-> a], ext(a, a, v), 5' 

[£6} k 
x5' 



where S' 

a 

We have two choices for how to implement lazy non-determinism. 



Option 1: Lose precision; simplify implementation This seman- 
tics introduces a subtle precision difference over the baseline. Con- 
sider a configuration where a reference to a variable and a binding 
of a variable will happen in one step. With laziness, the reference 
will mean the original binding(s) of the variable or the new one, 
because the actual store lookup is delayed one step (i.e. laziness is 
administrative). Without laziness, the reference will fan out to all 
the bindings of the variable before the new binding happens and 
thus might have an observable precision difference. 

Option 2: Regain precision; complicate implementation The ad- 
ministrative nature of laziness means that we could remove the loss 
in precision by duplicating the reduction relation to specialize vari- 
able lookup. This works since in the semantics of ISWIM with 
store-allocated results consumes the value component of states in 
one step. This is not the case for semantics that replicate the value 
component across reductions, say for popping off exception han- 
dler frames. Further convolution is needed to remove the admin- 
istrative nature of laziness in these semantics. Due to the increase 
of conceptual complexity for negligible benefit, we decided against 
this approach. 

Our choice: option 1 The configurations that lead to precision 
loss happen too rarely to warrant the significant increase in time 
and memory needed for this eager non-determinism. Indeed, were 
the variable reference a step later and another binding not made in 
that step, the results of the two approaches are the same. 

5.2 Abstract compilation 

The prior optimization saved time by doing the same amount of rea- 
soning as before but in fewer transitions. We can exploit the same 
idea — same reasoning, fewer transitions — with abstract compila- 
tion. Abstract compilation is precision-preserving and transforms 
complex expressions whose abstract evaluation is deterministic 
into "abstract bytecodes." The abstract interpreter then does in one 
transition what previously took many. In short, abstract compilation 
eliminates unnecessary allocation, deallocation and branching. 

The following example illustrates the essence of abstract com- 
pilation effect: 

app (app (app (x, ei), e 2 ), e 3 ) 



[_] : Expr — > Env x Store x Kont — > State 
Jvar (x)J = A(p, a, k).co (ft, addr (p(x)), a) 
[lit (01 = A(p,cr,ft).co (k,1,o) 
[[lam (x, e)J = A(p, a, k).co (ft, clos (x, [efl, p), a) 
[[app* (e , ei )] = X s (p, a, k) . [e ] 6 (p, a , ar* ([ei J , p, a)) 



where a, o' = push s e (a, ft) 
= A 4 (p,cr, K).[[eo] i (p, o',t 
where a, o' = push e (<r, ft) 



[if £ (e ,ei,e 2 )] = A*(p,o-,/t).[e ]' 5 (p,cr',fi' 5 ([ei] 1 \e 2 \,p,a)) 



Figure 6. Abstract compilation 



eval(e) = {<; | Je] (e, 0, 0, mt) i — » where 



co (mt, v, a 
co (ar* (fc, p, a), v, a 
co (fn*(M, a),v,a 
co (f i s (ko, ki, p, a), tt, a 
co (f i' 5 (A)o, fci, p, a), f f , <t 



ap^(clos (x, k, p), v, a, k 
where p ,cr ,5 
ap (o, v, a, k) i — > co (n, v , a) 



ans (<r, v) 

k S (p,a,fn S i(v,a)) 

&pe(v, it, k, <t) where k £ cr(a) 

k(i(p, a, k) where k G o(a) 

ki(p, a, k) where ft G o(a) 

k s (p', a', ft) 



= bind i (p, a, x, v) 



where ft G o-{a) and v £ A(o, v) 



Figure 7. Abstract abstract machine for compiled ISWIM 

makes the following transitions: 

ev (app (app (app (a;, ei), e 2 ), e 3 ), p, k, a ) (1) 

i — > ev (app (app (x, ei), e 2 ), p, ar (e 3 , p, ai), <7i) (2) 

i — > ev (app (x, ei),p, ar (e 2 , p, a 2 ), a 2 ) (3) 

i — > ev (x,p, ar (e 1 ,p, a 3 ),cr 3 ) (4) 

i — > co (ar (ei, p), v, 04) where v £ a(p(a)) (5) 

where a 4 = a U {[a! H» {ft}], [a 2 H» ar (e 3 , p, ai)], [a 3 
ar (e 2 ,p, a 2 )]. 

The compilation step converts expressions into functions that 
expect the other components of the ev state. Its definition in figure 
|6]shows close similarity to the rules for interpreting ev states. The 
next step is to change reduction rules that create ev states to instead 
call these functions. Figure|7]shows the modified reduction relation. 

5.3 Locally log-based store deltas 

Every step the analysis makes for the above techniques requires 
joining large stores together. Not every step will modify all ad- 
dresses of the store, so joining entire stores is wasteful in terms of 
memory and time. We can instead log store changes and replay the 
change log on the full store after all steps have completed. This uses 
far fewer join operations, leading to less overhead, and is precision- 
preserving. 

We represent change logs as £ £ StoreA = [Addr x 
V (Storeable))* . Each a U [a 1— > vs] becomes a log addition 
(a, vs):£, where £ begins empty (e) for each step. Applying the 
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changes to the full store is straightforward: 

replay : StoreA x Store — > Store 

replay (e, a) — a 

replay((a, vs):£, a) — replay(£, a U [a i— > vs]) 

The transition relation is identical except for the addition of this 
change log. Lookups will never rely on the change log, and can use 
the originally supplied store unmodified: 

(ap*(clos (x,e,p),v,K),a,£) i — > (ev 4 (e, /?',«;),£') 
where p , 8' = bmd s e (p, a, x, v) 

bind £ (p, a,^,x,v) = p[x i-> a], (a,force(a, «)):£, <5' 
where 5' = [£5\ k 
a = :r<5' 

Compilation changes to additionally take a £ component, so the 
above rule's right hand side would instead be k s (p',o-, 
where k = |e] would be in the closure. 

We also lift i — > to accommodate for this asymmetry in the 
input and output. For each state that is stepped, we feed the output 
changes to the next so that all changes get accumulated: 

(cs, a) h— — ^ (cs U cs' , replay(£, a)) 

where (cs' , £) = step* (0, cs, e) 

step* (S, 0,0 = (S,0 

step* (S, {c} U cs,^ — step* (S U cs* , cs, £*) 

cs* = {c | (c,cr,£) i — > (c',C)} 

= concat{£ c | (c,o-,£) i — > (c',£ c )} 

Here concat : V(StoreA) — > StoreA flattens the lists of 
changes to one; the order in which it appends does not matter: 

concat{0) = t 

concat({^} U D) — append^, concat(D)) 

5.4 Timestamping an imperative store 

Thus far, we have made our optimizations in a purely functional 
manner. For the next series of optimizations, we need to dip into the 
imperative. We can motivate this entire sequence of optimizations 
by focusing on the largest bottleneck in the current state-space ex- 
ploration algorithm: checking to see if a state has already been seen. 
Given two states, checking equality is expensive because the stores 
within each are large, and every entry must be checked against 
every other. Hashes can sometimes rule out inequality relatively 
quickly, but the incidence of collisions and actual equality is costly. 

And, there is a better way. Shivers' original work on fc-CFA 
was susceptible to the same problem, and he suggested three com- 
plementary optimizations: (1) make the store global; (2) update the 
store imperatively; and (3) associate every change in the store with 
a version number - its timestamp. Then, put timestamps in states 
where previously there were stores. Given two states, the analysis 
can now compare their stores just by comparing their timestamps - 
a constant-time operation. 

There is a subtle loss of precision in Shivers' original time- 
stamp technique that we can fix. For a given abstract state, all 
writes to the global store need to be delayed until the analysis 
considers all branches from that abstract state. This avoids cross- 
branch pollution which would not otherwise happen, e.g., when one 
branch writes to address a and another branch reads from address 
a. Fortunately, given our conversion to log-based stores this change 
is straightforward. 



eval(e) := 

a, todo, seen, T := 0, 0, [] , 
[e](e, 0, 0, e, mt) 
while (true) : 

if todo = 0: return (keys (seen) , a) 
else : 

let old := todo 
todo, £ := 0, e 

foreach c 6 old: A? := false; c() 
unless f = e: T += 1; replay! (£, <r) 



Figure 8. Imperative algorithm 



At this point, we can begin to think of the analysis as an imper- 
ative algorithm with six major components: 

• a, the store; 

• todo, the workset; 

• seen, a map from states to the timestamps at which they were 
last seen; 

• £, the store changes for the current step; 

• A?, a boolean tracking whether the stepped state contributed a 
store change; and 

• T, the timestamp of the store. 

The new eval calculation is defined in figure [8] To ensure 
termination, we guard against adding c to todo by the following 
check: 

A? V seen(c) / T 

If it succeeds, todo gets c and we update seen to map c to T. 

After all steps complete, we apply £ to a imperatively (with 
replay ! ) and increase T as long as there was at least one change 
in £. This logic leads to termination if we know that each (a, vs) 
in £ would change the value of a in the current store. Thus, we 
also guard additions to £ so that only updates that would change 
the store are permitted. Each time £ is successfully extended, we 
set A? to true. Before each individual step, we set it to false. 

5.5 Pre-allocating the store 

Internally, the algorithm at this stage uses hash tables to model the 
store. This is because stores used to be distributed to all states, 
which required a compact, dynamic representation. But, such a 
dynamic structure isn't necessary when we know the structure of 
the store in advance: we know all possible entries, and we know its 
maximum size. 

In a monovariant analysis, the domain of the store is exactly the 
set of expressions in the program. If we label each expression with a 
unique natural, the analysis can index directly into the store without 
a hash or a collision. Even for polyvariant analyses, it is possible 
to compute the maximum number of addresses and similarly pre- 
allocate either the spine of the store or (if memory is no concern) 
the entire store. 

5.6 Abstracting literal compound data 

The abstraction of compound data structures like lists has deep im- 
plications for precision and performance. The classic list-heavy 
benchmark boyer — a simplified implementation of the Boyer- 
Moore theorem-prover — drove our own considerations for abstrac- 
tions for compound data structures. Boyer's fluent use of literal list 
dooms the "natural" abstraction of lists to fail from over precision. 
In short, if we interpret a literal list ' (a b c) as (cons 'a (cons 
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'b (cons 'c '()))), even a monovariant allocation strategy for 
abstract cons cells will precisely and exactly create a 3-cell list. 

If code contains a length-300 literal list, then its analysis yields a 
length-300 abstract list as well. In many cases, the specific contents 
of the list add little precision to the resulting analysis, and yet the 
analyzer will dutifully execute recursive functions over the entirety 
of these structures at every encounter. An alternative in this case is 
to explicitly flatten literal lists into single abstract cells. 

We explore both options here. We will first explain the natural 
abstraction of tuples, and then we will explain a less precise alloca- 
tion strategy from uniform fc-CFA that we use for large compound 
data literals that we modified from the implementation in [32|. 

5.6.1 Option 1: The natural abstraction 

The uniform way AAM approaches a simple abstraction strategy 
is to cut recursion out of the data definition by tying the recursive 
knot through the abstract store. For Scheme, the grammar for values 
looks like the following: 

Value ::= #t | #f | (cons Value Value) ' () ... 

Upon evaluating a cons application, we instead allocate two ad- 
dresses a and d, join them to the respective values in the store, and 
return the flattened (cons a d) value. Since these addresses are 
all distinguished at different syntactic call sites in the uniform fc- 
CFA allocation strategy, and quoted lists are sugar for a sequence of 
calls to cons, this abstraction explodes the value space. Analyzing 
a function that counts the number of atoms in a literal s-expression 
would actually interpret that function at least that number of times 
(more because of intermediate conses). Indeed, even in our fastest 
implementation, this abstraction causes the analysis of Boyer to be 
430 times slower than the approach we will now describe. 

5.6.2 Option 2: A precise yet compact abstraction 

The number of syntactic uses of cons versus implicit uses via 
literal lists is smaller in typical Scheme programs. We use the above 
abstraction for these syntactic uses, but choose to interpret literal 
lists as not always sugar for cascading conses. In particular, if a 
list literal is "too big" (in our case length > 1), we interpret the 
list as a circular data structure; the right address points back to the 
cons itself, and the left address points to all of the elements of 
the list. We take this farther and join the list elements together in 
with a coarse type-based abstraction. In effect, large lists or vectors 
of literal numbers become unbounded lists/vectors of "number." 
Heterogeneous data is combined to just "data," rather than a union 
of "number", "string", etc. since primitives are interpreted on every 
combination of values that flow to them, and unions lead to more 
primitive interpretations. 

Quotation is special because it cannot introduce function val- 
ues, which is important to enhancing our technique's soundness to 
conceptual complexity ratio. If we join two values of different type 
together, we don't get "anything," which has complex meaning in 
higher-order languages (Shivers' escape semantics) and is overly 
approximate. We instead get "any quotable value," which has much 
simpler semantics. 

There are a few steps to consider: 

• Define special value lattice elements for compound data do- 
mains that can be quoted (e.g. QPair for {(cons a; di)} iy 
QVector for immutable vectors, etc.) 

• Define a "larger" value lattice element for all quotable data, 
QData 

• Interpret (quote (a . . . ) ) as (qlist a . . .), a new primi- 
tive function defined in figure [9] and similar definitions for im- 
mutable vectors. 



A(cr, qlist) = 
A(c, qlist, V... + ) — 
where a' — 



\_\(v,vs...) 

merge(v, v) 
merge(n, m) 
merge(n, v) 
merge((cons a d), (cons a' d')) 
mer<7e(QPair, (cons a d)) 
merge (QPair, v) 



CO, a) 
((cons a d), a') 
aU[a^\J(v...)} 
U[dH{'(), (cons a d)}] 

v 

merge(v, | (vs...)) 

v 

Number 

QData 

QPair 

QPair 

QData 



Figure 9. Quoted list primitive 



Figure 10. Overview performance comparison between baseline 
and optimized analyzer (entries of t mean timeout, and m mean out 
of memory). 

• Extend the A axioms to include conservative meaning for these 
new values (e.g. (car QData) = QData, (addl QData) = 
Number and log "possible type error") and allow them to al- 
locate addresses and change the store 

6. Evaluation 

We have implemented, optimized, and evaluated an analysis frame- 
work supporting higher-order functions, state, first-class control, 
compound data, and a large number of primitive kinds of data 
and operations such as floating point, complex, and exact rational 
arithmetic. The analysis is evaluated against a suite of benchmarks 
drawn from the literature. For each benchmark, we collect analysis 
times, peak memory usage, and the rate of states-per-second ex- 
plored by the analysis for each of the optimizations discussed in 
section[5] cumulatively applied. The analysis is stopped after con- 
suming 30 minutes of time or 1 gigabyte of space. When present- 
ing relative numbers, we use the timeout limits as a lower bound 
on the actual time required, thus giving a conservative estimate of 
improvements. 

All benchmarks are calculated as an average of 5 runs, done 
in parallel, on an 12-core, 64-bit Intel Xeon machine running at 
2.40GHz with 12Gb of memory. 

Many benchmarks cause the baseline analyzer to take longer 
than 30 minutes or to consume more 1 gigabyte of memory, at 
which point the analysis is stopped. This is the case for the largest 
benchmark program, which is 3,500 lines of code and takes under a 
minute in the most optimized analyzer. For those benchmarks that 
did complete on the baseline, the optimized analyzer outperformed 
the baseline by a factor of two to three orders of magnitude. 

We use the following set of benchmarks: 

1 . nucleic: a floating-point intensive application taken from molec- 
ular biology that has been used widely in benchmarking func- 
tional language implementations [ 12] and analyses (e.g. 1131 
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32 1). It is a constraint satisfaction algorithm used to determine 
the three-dimensional structure of nucleic acids. 

2. matrix tests whether a matrix is maximal among all matrices 
of the same dimension obtainable by simple reordering of rows 
and columns and negation of any subset of rows and columns. 
It is written in continuation-passing style (used in 1 131 1321 ). 

3. nbody: implements the Greengard multipole algorithm for 
computing gravitational forces on point masses distributed uni- 
formly in a cube (used in I13II32I ). 

4. earley: Earley's parsing algorithm, applied to a 15-symbol in- 
put according to a simple ambiguous grammar. A real program, 
applied to small data whose exponential behavior leads to a 
peak heap size of half a gigabyte or more during concrete exe- 
cution. 

5. maze: generates a random maze using Scheme's call/cc op- 
eration and finds a path solving the maze (used in 11311321 ). 

6. church: tests distributivity of multiplication over addition for 
Church numerals (introduced by 1 30 1). 

7. lattice: enumerates the order-preserving maps between two fi- 
nite lattices (used in 1 1 3 32]). 

8. boyer: a term-rewriting theorem prover (used in I13II32I ). 

9. mbrotZ: generates Mandelbrot set fractal using complex num- 
bers. 

10. graphs: counts the number of directed graphs with a distin- 
guished root and k vertices, each having out-degree at most 2. It 
is written in a continuation-passing style and makes extensive 
use of higher-order procedures — it creates closures almost as 
often as it performs non-tail procedure calls (used by 1 131 1321 ). 

Figure [10] gives an overview of the benchmark results in terms 
of absolute time, space, and speed between the baseline and most 
optimized analyzer. Figure[TT|plots the factors of improvement over 
the baseline for each optimization step. The dip we see in transition 
rate even though time taken decreases is to be expected - fewer 
"easy" states are added by abstract compilation. It increases again 
with the introduced algorithmic improvements. Accumulating store 
changes in addition to maintaining the store accounts for the higher 
memory usage when using the store delta technique without further 
improvements. 

Source code of the implementation and benchmark suite is at: 

https : //github. com/dvanhorn/oaam 

Comparison with other flow analysis implementations The anal- 
ysis considered here computes results similar Earl, et al.'s 0-CFA 
implementation |9|, which times out on the Vardoulakis and Shiv- 
ers benchmark because it does not widen the store as described for 
our baseline evaluator. So even though it offers a fair point of com- 
parison, a more thorough evaluation is probably uninformative as 
the other benchmarks are likely to timeout as well (and it would 
require significant effort to extend their implementation with the 
features needed to analyze our benchmark suite). That implemen- 
tation is evaluated against much smaller benchmarks: the largest 
program is 30 lines. 

Vardoulakis and Shivers evaluate their CFA2 analyzer 1301 
against a variant of 0-CFA defined in their framework and the ex- 
ample we draw on is the largest benchmark Vardoulakis and Shivers 
consider. More work would be required to scale the analyzer to the 
set of features required by our benchmarks. 

The only analyzers we were able to find that proved capable of 
analyzing the full suite of benchmarks considered here were the 
Soft Typing system of Wright and Cartwright 1311 and, in many 
ways its successor, the Polymorphic splitting system of Wright and 



Jagannathan 1 32 |[jUnfortunately, these analyses compute an inher- 
ently different and incomparable form of analysis. Consequently, 
we have omitted a complete comparison with these implementa- 
tions,. The AAM approach provides more precision in terms of 
temporal-ordering of program states, which comes at a cost that 
can be avoided in constraint-based approaches. Consequently im- 
plementation techniques cannot be "ported" between these two ap- 
proaches. However, our optimized implementation is within an or- 
der of magnitude of the performance of Wright and Jaganathan's 
analyzer. Although we would like to improve this to be more com- 
petitive, the optimized AAM approach still has many strengths to 
recommend it in terms of precision, ease of implementation and 
verification, and rapid design. 

7. Related work 

Abstracting Abstract Machines This work clearly closely fol- 
lows Van Horn and Might's original papers on abstracting abstract 
machines |27, 29 1, which in turn is one piece of the large body 
of research on flow analysis for higher-order languages (see Midt- 
gaard 1 1 8 J for a thorough survey). The AAM approach sits at the 
confluence of two major lines of research: (1) the study of abstract 
machines 1161 and their systematic construction [25|, and (2) the 
theory of abstract interpretation (3) [4). 

Frameworks for flow analysis of higher-order programs Be- 
sides the original AAM work, the analysis most similar to that pre- 
sented in section|3]is the infinitary control-flow analysis of Nielson 
and Nielson 1 23 1 and the unified treatment of flow analysis by Ja- 
gannathan and Weeks 1 14 |. Both are parameterized in such a way 
that in the limit, the analysis is equivalent to an interpreter for the 
language, just as is the case here. What is different is that both give 
a constraint-based formulation of the abstract semantics rather than 
a finite machine model. 

Abstract compilation Boucher and Feeley (TJ introduced the idea 
of abstract compilation, which used closure generation 1101 to 
improve the performance of control flow analysis. We have adapted 
the closure generation technique from composition evaluators to 
abstract machines and applied it to similar effect. 

Constraint-based program analysis for higher-order languages 

Constraint-based program analyses (e.g. 1171 l23l 1321 ) typically 
compute sets of abstract values for each program point. These val- 
ues approximate values arising at run-time for each program point. 
Value sets are computed as the least solution to a set of (inclusion or 
equality) constraints. The constraints must be designed and proved 
as a sound approximation of the semantics. Efficient implementa- 
tions of these kinds of analyses often take the form of worklist- 
based graph algorithms for constraint solving, and are thus quite 
different from the interpreter implementation. The approach thus 
requires effort in constraint system design and implementation, and 
the resulting system require verification effort to prove the con- 
straint system is sound and that the implementation is correct. 

This effort increases substantially as the complexity of the an- 
alyzed language increases. Both the work of maintaining the con- 
crete semantics and constraint system (and the relations between 
them) must be scaled simultaneously. However, constraint sys- 
tems, which have been extensively studied in their own right, en- 
joy efficient implementation techniques and can be expressed in 
declarative logic languages that are heavily optimized |2|. Conse- 
quently, constraint-based analyses can be computed quickly. For 
example, Jagannathan and Wright's polymorphic splitting imple- 
mentation [32] analyses the Vardoulakis and Shivers benchmark 



This is not a coincidence; these papers set a high standard for evaluation, 
which we consciously aimed to approach. 
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(a) Peak memory usage 




(b) Rate of state transitions 




§5.2 §5.3 

(c) Total analysis time 



Figure 11. Factors of improvement over baseline for each step of optimization (bigger is better). 
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about 25 times faster than the fastest implementation considered 
here. These analyses compute very different things, so the perfor- 
mance comparison is not apples-to-apples. 

The AAM approach, and the state transition graphs it generates, 
encodes temporal properties not found in classical constraint-based 
analyses for higher-order programs. These analyses (ultimately) 
compute judgments on program terms and contexts, e.g., at expres- 
sion e, variable x may have value v. The judgments do not relate 
the order in which expressions and context may be evaluated in a 
program, e.g., it has nothing to say with regard to question like, "Do 
we always evaluate ei before 62?" or "Is it always the case that a 
file handle is opened, read and then closed in that order?" The state 
transition graphs can answer these kinds of queries, but this does 
not come for free: respecting temporal order imposes an order in 
which states and terms may be evaluated during the analysis. 

We view the primary contribution of this work as a systematic 
path that eases the design, verification, and implementation of 
analyses using the abstracting abstract machine approach to within 
a factor of performant constraint-based analyses. 

8. Conclusion 

Abstract machines are not only a good model for rapid analysis 
development, they can be systematically developed into efficient 
algorithms. 

Acknowledgments We thank Suresh Jagannathan for providing 
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